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OpenCL 


const char * program src = 


" kernelYn" 

"void pi( global float * out) \n" 
"{\n" 

" out[0] = 3.14159f;\n" 

"JAn" E 


int main(int argc, char ** argv) { 
// Insert device setup code here. 
program = clCreateProgramWithSource(context,1,&program src,NULL,&error) ; 
error = clBuildProgram(program,1,&device id,NULL,NULL,NULL) ; 


// Insert buffer allocation and program execution code here. 
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Clover 


Overview 


e Clover - hardware independent OpenCL API 
implementation. 


e Gallium Drivers - Hardware dependent userspace 
GPU drivers. 


* libClang - OpenCL C language frontend. 
* libLLVM - Hardware dependent codegen. 


e libclc - LLVM IR bitcode library with device builtin 
functions (e.g. sin, cos, min, max). 
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Clover 
API Status 


e OpenCL 1.1 Supported. 
e Missing OpenCL 1.2 Functions: 


e clEnqueueMigrateMemObjects(), 
clGetKernelArginfo(), clEnqueueFillBuffer(), 
clEnqueueFilllmage() 


e OpenCL C 1.2 supported via clang. 
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Clover 


Gallium - AMD 


e Missing image support in LLVM/libclc. 


e Open Source CTS making development easier. 


e cl khr fpl6 almost done. 


e Benefiting from shared compiler with ROCm. 
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Clover 


Gallium - nouveau 


Work in progress OpenCL C support. 
e OpenCL C > SPIR-V > NV machine code. 


Compute API implementation mostly in place. 
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libclc 
Status 


e nvptx - basic support. 


e amdgpu- well supported. 


LLVM 


e Missing OpenCL 1.2 functions: maxmag, minmag, nan, 


powr, remainder, remguo, rootn, tanpi, half cos, 
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GPU Code/IR 


half divide, half exp, half exp2, half_exp10, half log, 


half |og2, half_log10, half powr, half recip, 
half rsgrt, half sin, half sart, native logi0 


e Missing Image functions. 
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Pocl 
OpenCL 


Status 
Program 
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e Supports LLVM 5.0. OpenCL C 
* Approaching OpenCL 1.2 completeness. libClang «< Pocl 
e New Cuda backend. libLLVM me eai p 
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Beignet 
OpenCL 


Status 
Program 
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. LLVM 5.0. 
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e OpenCL 2.0. libClang™ 
N | LLVM IR Beignet 
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ROCm OpenCL 


Status 


e Supports ROCm compatible AMD hardware. 
e OpenCL 1.2 API with OpenCL C 2.0. 


e Function call support in progress. 


e Compiler optimizations for LLVM AMDGPU backend. 
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Clover-GSOC11 


Status 


Inspiration for Clover 


Development started in 2009. 


Goal was OpenCL over gallium API. 


Picked up as Google Summer of Code Project in 2011. 


e Gallium support dropped. 
e Focused on supporting CPU targets. 
Development stopped after 2011. 


Inspired clover in mesa, but the mesa code is a 
complete rewrite. 


OpenCL 
Program 


LLVM 


libClang 
libLLVM 


Clover-GSOC11 


OpenCL C 


Clover 
CPU Code A 


A 


Clover 


Builtin Lib 


O redhat. 


Tl-OpenCL 


Status 


OpenCL for DSP 
e Based on Clover-GSOC2011. 


e Still under active development. 


e Borrowed device library code from libclc/pocl. 


e OpenCL1.1 
e Some CPU support. 
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Shamrock 
Status 


Fork of Tl-OpenCL 


e Improved CPU support in Tl-OpenCL. 


e CPU support for OpenCL 1.2. 


e No commits for one year. 
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OpenCL 


Overview 


Lots of fragmentation 


4 Different device library implementations. 


3 Implementations supporting various subsets of 
AMD hardware. 


Beignet only conformant implementation (1.2). 
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1 Requires proprietary drivers 
2 HSA Compatible Hardware 
3 ROCm Compatible Hardware 
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OpenCL 


Overview 


Why so much fragmentation ? 

e Missing features lead to new implementation and 
prevent consolidation: 
e Clover missing CPU support. 
e POCL missing interface to open GPU drivers. 
e Clover/Gallium not supporting Intel GPUs. 

e Well-tested closed implementation open sourced 
e ROCmOpenCL 
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OpenCL 


Overview 


Other incentives for fragmentation: 


OpenCL ICD makes it easy for multiple 
implementations to co-exist. 


Leveraging LLVM/Clang for OpenCL C support 
makes new implementations easier. 
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OpenCL 


Vendor Status 


Vendor adoption has slowed for recent versions. 

* Current OpenCL version is 2.2, spec released May 
2017. 

e OpenCL 1.2 spec was released in November 2011. 

e NVIDIA achieved 1.2 conformance in May 2015. 

e Most SOC vendors support 1.2. 


e Only ARM and Qualcomm only SOC vendors to 
support 2.0. 


e | successful conformance submission in 2017: 


e ARM Mali OpenCL 2.0 


Version 


Intel 2.1 
AMD 2.0 
NVIDIA 1.2 
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OpenCL Implementation Rates 


Months to Conformance 
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OpenCL 


Open Source Outlook 


OpenCL 1.0 > 2.2 


Slow vendor adoption has been a positive: 


e Open source implementations not that far behind. 


Fragmentation still an issue going forward. 


Incentive for implementing OpenCL 2.0+ may be low. 
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OpenCL 


Future Direction 


‘OpenCL Next’ discussed at IWOCL 2017 


Convergence of OpenCL and Vulkan API. 
e Not clear what this means. 


Fine grained feature capabilities. 


e API subsets for different devices / use cases. 


More language flexibility. 
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OpenCL 


Open Source Outlook 


‘OpenCL Next’ could be good for Open Source implementations if: 


Full API implementation not required for 
conformance. 


Shared effort with Vulkan. 


Compiler frontends not part of API. 
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SPIR-V 


Status 


Translators to/from SPIR-V 


e Khronos SPIR-V LLVM 


Bi-directional SPIR-V to LLVM IR converter. 
Fork of LLVM 3.8. 


e clspv 


OpenCL C to SPIR-V. 


Stand-alone project. 


Uses libclang to compile OpenCL C to LLVM IR. 


Converts LLVM-IR to SPIR-V. 


OpCapbility Shader 
OpMemoryModel Logical Simple 
OpEntryPoint GLCompute %3 "main" 


OpExecutionMode %3 LocalSize 64 64 1 


OpTypeVoid 
OpTypeFunction %1 
OpFunction %1 None %2 
OpLabel 

OpReturn 
OpFunctionEnd 


X X X Æ 


FP W Nr 


https://github.com/KhronosGroup/SPIRV-Tools/blob/master/syntax.md 
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ROCm 


status 


GPU Runtime / Toolchain for AMD hardware 


e Full stack: Kernel Drivers, Userspace Drivers, 
toolchain. 


e Completely Open Source. 
e Low-level compute API for AMD. 
e Clang based toolchain: 

e Direct-to-ISA compilation. 

e Assembler / Disassembler. 


e HCC compiler for C++AMP/HCC language. 
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OpenMP 
Status 


#include <omp.h> 
#include <stdio.h> 
#include <stdlib.h> 


Open Source Runtimes int main (int argc, char *argv[]) 


int nthreads, tid; 


e GCC Runtime (libgomp). 


/* Fork a team of threads giving them their own copies of variables */ 
#pragma omp parallel private(nthreads, tid) 
{ 


e Clang Runtime (libomp). 
/* Obtain thread number */ 
tid = omp get thread num(); 


e Both runtimes have Cuda backends. ormer Elie Werle) Tron Ge = sev, tä); 


/* Only master thread does this */ 
if (tid == 0) 


nthreads = omp get num threads(); 
printf("Number of threads = %d\n", nthreads); 


i 
J /* All threads join master thread and disband */ 
5 
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Cuda 


Overview 


Single-source GPGPU 


e Low-level: Driver API. 
e Higher-level: Runtime API. 


e Compiler inserts API calls into program for launching 
kernel. 


#include <iostream> 


— global void axpy(float a, float* x, float* y) { 


y[threadIdx.x] = a * x[threadIdx.x]; 


int main(int argc, char* argv[]) { 


const int kDataLen = 4; 


float a = 2.0f; 
float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f}; 
float host_y[kDataLen]; 


// Copy input data to device. 

float *device x, *device y; 

cudaMalloc(&device x, kDataLen * sizeof(float)); 

cudaMalloc(&device y, kDataLen * sizeof(float)); 

cudaMemcpy (device x, host x, kDatalen * sizeof(float), 
cudaMemcpyHostToDevice) ; 


// Launch the kernel. 
axpy<<<1, kDataLen>>>(a, device x, device y); 


// Copy output data to host. 

cudaDeviceSynchronize(); 

cudaMemcpy (host y, device y, kDataLen * sizeof(float), 
cudaMemcpyDeviceToHost) ; 


// Print the results. 
for (int i = 0; i < kDataLen; ++i) { 
std::cout << "y[" << i << "] =" << host yli] << "An"; 


} 


cudaDeviceReset(); 
return 0; 
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Cuda 


Status 


Open Source Work 


e gdev 


e Kernel module, userspace API. 
e Last public commit 3 years ago. 
e Clang CUDA frontend 
e AMD HIP 


#include <iostream> 


— global void axpy(float a, float* x, float* y) { 


y[threadIdx.x] = a * x[threadIdx.x]; 


int main(int argc, char* argv[]) { 


const int kDataLen = 4; 


float a = 2.0f; 
float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f}; 
float host_y[kDataLen]; 


// Copy input data to device. 

float *device x, *device y; 

cudaMalloc(&device x, kDataLen * sizeof(float)); 

cudaMalloc(&device y, kDataLen * sizeof(float)); 

cudaMemcpy (device x, host x, kDatalen * sizeof(float), 
cudaMemcpyHostToDevice) ; 


// Launch the kernel. 
axpy<<<1, kDataLen>>>(a, device x, device y); 


// Copy output data to host. 

cudaDeviceSynchronize(); 

cudaMemcpy (host y, device y, kDataLen * sizeof(float), 
cudaMemcpyDeviceToHost) ; 


// Print the results. 
for (int i = 0; i < kDataLen; ++i) { 
std::cout << "y[" << i << "] =" << host yli] << "An"; 


} 


cudaDeviceReset(); 
return 0; 
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CUDA 


Future Directions in Open Source 


e Revive gdev. 

e Gallium state tracker. 

e Conversion tools to/from PTX. 
e Clang as a Cuda frontend. 


e Open source replacement for Cuda toolchain 
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Future Directions 


Summary 


Common trends for GPGPU API 
e Smaller, low level APIs: 

e OpenCL Next 

e ROCm 

e CUDA Device API 


* IR as driver input. 


e Device language specs not part of API definition. 
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